AITopics | grade score

Collaborating Authors

grade score

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Grade Score: Quantifying LLM Performance in Option Selection

Iourovitski, Dmitri

arXiv.org Artificial IntelligenceJun-20-2024

Large Language Models (LLMs) have demonstrated remarkable intelligence and versatility in tasks related to logic, reasoning, and grading [4, 1, 7]. This has led to the increasing use of LLMs being the judges of arbitrary user presented options or at times judges of other LLMs themselves[11, 12]. However, previous research has highlighted that LLMs exhibit biases and a tendency to favor the first option presented to them. This paper explores various methods to mitigate order bias and improve the consistency of LLM judging. To facilitate progress in the study of LLM biases and consistency, we introduce a novel metric called the Grade Score, designed to quantify both the selection consistency and bias exhibited by an LLM, providing a comprehensive measure of an LLM's judging performance. A high score indicating a model that is highly consistent and fair in terms of order, while a low score suggests the presence of significant order bias or inconsistency in the model's choices. The Grade Score serves as a valuable tool for researchers and practitioners to assess and compare the performance of different LLMs in judging tasks. By quantifying the degree of instability and bias, the Grade Score enables the identification of models that exhibit superior judging capabilities and facilitates the development of techniques to mitigate biases and improve consistency.

grade score, llm, order bias, (15 more...)

arXiv.org Artificial Intelligence

2406.12043

Country:

North America > United States > Illinois > Champaign County > Urbana (0.04)
Europe > Ukraine > Kyiv Oblast > Kyiv (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Investigating Automatic Scoring and Feedback using Large Language Models

Katuka, Gloria Ashiya, Gain, Alexander, Yu, Yen-Yun

arXiv.org Artificial IntelligenceMay-1-2024

Automatic grading and feedback have been long studied using traditional machine learning and deep learning techniques using language models. With the recent accessibility to high performing large language models (LLMs) like LLaMA-2, there is an opportunity to investigate the use of these LLMs for automatic grading and feedback generation. Despite the increase in performance, LLMs require significant computational resources for fine-tuning and additional specific adjustments to enhance their performance for such tasks. To address these issues, Parameter Efficient Fine-tuning (PEFT) methods, such as LoRA and QLoRA, have been adopted to decrease memory and computational requirements in model fine-tuning. This paper explores the efficacy of PEFT-based quantized models, employing classification or regression head, to fine-tune LLMs for automatically assigning continuous numerical grades to short answers and essays, as well as generating corresponding feedback. We conducted experiments on both proprietary and open-source datasets for our tasks. The results show that prediction of grade scores via finetuned LLMs are highly accurate, achieving less than 3% error in grade percentage on average. For providing graded feedback fine-tuned 4-bit quantized LLaMA-2 13B models outperform competitive base models and achieve high similarity with subject matter expert feedback in terms of high BLEU and ROUGE scores and qualitatively in terms of feedback. The findings from this study provide important insights into the impacts of the emerging capabilities of using quantization approaches to fine-tune LLMs for various downstream tasks, such as automatic short answer scoring and feedback generation at comparatively lower costs and latency.

dataset, feedback generation, language model, (16 more...)

arXiv.org Artificial Intelligence

2405.00602

Country: Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Education > Educational Setting > Online (0.69)
Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback